Las enfermedades del análisis fueron Norovirus, Salmonella spp, Escherichia coli, Campylobacter spp, Shigella spp, Clostridium spp, Staphylococcus spp, Yersinia spp, Bacillus spp, otros virus y otros microorganismos. Las verduras eran Ensalada (todos los productos relacionados con la ensalada), Hojas (todos los productos relacionados con las hojas), Tomate y Otras verduras. Las frutas fueron Germinados (todos los productos relacionados con los germinados), Bayas, Melón, Zumos y Otras frutas. Los datos se encuentran en el archivo vegetables.zip. Aunque se pueden cargar desde el archivo vegetables.mat de Matlab, se recomienda leerlos desde los otros archivos. El archivo vegetables.dat contiene tres columnas de datos. La primera columna es la variable illness (enfermedad), la segunda es la variable veg.fruit (verduras/frutas). Y, por último, la tercera columna es la variable region (región). Los nombres de todos los elementos que necesitamos se hallan en el archivo vegetables_labels.txt de forma ordenada según la codificación.
library(readr)
## Warning: package 'readr' was built under R version 4.1.1
vegetables <- read_table2("C:/Users/Cristopher/Desktop/certificados/UOC/Bioestadistica y Bioinformatica/Analisis Multivariante/PEC1/vegetables.dat",
col_names = FALSE)
## Warning: `read_table2()` was deprecated in readr 2.0.0.
## Please use `read_table()` instead.
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_double(),
## X2 = col_double(),
## X3 = col_double()
## )
vegetables_labels <- read_table2("C:/Users/Cristopher/Desktop/certificados/UOC/Bioestadistica y Bioinformatica/Analisis Multivariante/PEC1/vegetables_labels.txt",
col_names = FALSE)
## Warning: `read_table2()` was deprecated in readr 2.0.0.
## Please use `read_table()` instead.
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_character()
## )
lista_vegetables<-as.data.frame(vegetables)
#lista_vegetables<-matrix(lista_vegetables,ncol = 3)
#tabla_vegetables <- matrix(lista_vegetables, ncol=3)
vegetables_eu<-lista_vegetables[1:197,]
vegetables_usa<-lista_vegetables[198:561,]
vegetables_usa$X1 = factor(vegetables_usa$X1,levels = c(1:11))
tabla_eu<-table(vegetables_eu)
tabla_usa<-table(vegetables_usa)
#agregamos las etiquetas para EU
rownames(tabla_eu)<-c("Norovirus","Salmonella","E-coli","Campylobacter","Shigella"
,"Clostridium","Staphylococcus","Yersinia","Bacillus","Other-viruses","Other-microorganisms")
colnames(tabla_eu)<-c("Salad","Leafy","Tomato","Other-vegetables","Sprouts","Berries","Melon","Juices","Other-fruits")
#agregamos la etiqueta para USA
rownames(tabla_usa)<-c("Norovirus","Salmonella","E-coli","Campylobacter","Shigella"
,"Clostridium","Staphylococcus","Yersinia","Bacillus","Other-viruses","Other-microorganisms")
colnames(tabla_usa)<-c("Salad","Leafy","Tomato","Other-vegetables","Sprouts","Berries","Melon","Juices","Other-fruits")
#finalmente obtenemos las tablas que necesitamos
tabla_eu
## , , X3 = 1
##
## X2
## X1 Salad Leafy Tomato Other-vegetables Sprouts Berries
## Norovirus 15 26 1 9 0 55
## Salmonella 8 12 1 3 11 0
## E-coli 3 0 0 1 3 0
## Campylobacter 2 1 0 0 0 0
## Shigella 1 0 0 3 0 0
## Clostridium 0 1 0 6 0 0
## Staphylococcus 0 0 0 3 1 0
## Yersinia 0 0 0 3 0 0
## Bacillus 2 0 0 3 0 1
## Other-viruses 2 0 0 0 2 0
## Other-microorganisms 0 0 0 9 0 0
## X2
## X1 Melon Juices Other-fruits
## Norovirus 0 0 2
## Salmonella 1 1 3
## E-coli 0 0 0
## Campylobacter 0 0 0
## Shigella 0 0 1
## Clostridium 0 0 0
## Staphylococcus 0 0 0
## Yersinia 0 0 0
## Bacillus 0 0 0
## Other-viruses 0 0 1
## Other-microorganisms 0 0 0
tabla_usa
## , , X3 = 2
##
## X2
## X1 Salad Leafy Tomato Other-vegetables Sprouts Berries
## Norovirus 97 62 5 9 0 5
## Salmonella 8 8 17 3 14 2
## E-coli 10 22 0 0 4 2
## Campylobacter 4 2 1 0 0 0
## Shigella 1 2 0 0 0 0
## Clostridium 0 0 0 0 0 0
## Staphylococcus 2 0 0 0 0 0
## Yersinia 0 0 0 0 0 0
## Bacillus 1 0 0 0 0 0
## Other-viruses 0 1 1 1 0 0
## Other-microorganisms 0 0 0 0 2 0
## X2
## X1 Melon Juices Other-fruits
## Norovirus 9 3 33
## Salmonella 14 0 5
## E-coli 0 6 2
## Campylobacter 1 0 1
## Shigella 0 0 0
## Clostridium 0 0 0
## Staphylococcus 0 0 0
## Yersinia 0 0 0
## Bacillus 0 0 1
## Other-viruses 0 0 2
## Other-microorganisms 1 0 0
En este caso s epuede observar que las principalmente encontramos relacionada el norovirus con berries, en menor medida con "salad" y "leafy"
chisq <- chisq.test(dt)
## Warning in chisq.test(dt): Chi-squared approximation may be incorrect
chisq
##
## Pearson's Chi-squared test
##
## data: dt
## X-squared = 211.84, df = 80, p-value = 7.245e-14
En este caso vemos que las variables estan asociadas estadisticamente significativas, cuyo p-value es de 7.245e-14.
library(FactoMineR)
datos.ca <- CA(table_eu2, graph = FALSE)
plot.CA(datos.ca)
library("FactoMineR")
res.ca <- CA(dt, graph = FALSE)
res.ca
## **Results of the Correspondence Analysis (CA)**
## The row variable has 11 categories; the column variable has 9 categories
## The chi square of independence between the two variables is equal to 211.8389 (p-value = 7.245277e-14 ).
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$col" "results for the columns"
## 3 "$col$coord" "coord. for the columns"
## 4 "$col$cos2" "cos2 for the columns"
## 5 "$col$contrib" "contributions of the columns"
## 6 "$row" "results for the rows"
## 7 "$row$coord" "coord. for the rows"
## 8 "$row$cos2" "cos2 for the rows"
## 9 "$row$contrib" "contributions of the rows"
## 10 "$call" "summary called parameters"
## 11 "$call$marge.col" "weights of the columns"
## 12 "$call$marge.row" "weights of the rows"
library("factoextra")
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
eig.val <- get_eigenvalue(res.ca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.010216e-01 4.659260e+01 46.59260
## Dim.2 4.251785e-01 3.953956e+01 86.13216
## Dim.3 6.851043e-02 6.371141e+00 92.50330
## Dim.4 4.237996e-02 3.941132e+00 96.44443
## Dim.5 3.540674e-02 3.292657e+00 99.73709
## Dim.6 2.820738e-03 2.623151e-01 99.99940
## Dim.7 6.403824e-06 5.955248e-04 100.00000
## Dim.8 5.362702e-33 4.987055e-31 100.00000
fviz_screeplot(res.ca, addlabels = TRUE, ylim = c(0, 50))
Observamos que los dos primeros dimensiones representan el 86.1 de la varianza,
fviz_ca_biplot(res.ca, repel = TRUE)
fviz_ca_biplot(res.ca,
map ="rowprincipal", arrow = c(TRUE, TRUE),
repel = TRUE)
Ahora hacemos lo mismo para la tabla de USA
library(gplots)
#convertimos en array los datos
table_usa2<-as.data.frame.array(tabla_usa)
du <- as.table(as.matrix(table_usa2))
balloonplot(t(du), main ="Tabla datos USA", xlab ="", ylab="",
label = FALSE, show.margins = FALSE)
Observamos que la enfermedad de norovirus se encuentra relacionada principalmente con "salad" y "leafy" principalmente, en la region de USA.
chisq2 <- chisq.test(du)
## Warning in chisq.test(du): Chi-squared approximation may be incorrect
chisq2
##
## Pearson's Chi-squared test
##
## data: du
## X-squared = NaN, df = 80, p-value = NA
library(FactoMineR)
datos.usac <- CA(table_usa2, graph = FALSE)
## Warning in CA(table_usa2, graph = FALSE): The rows Clostridium, Yersinia sum at
## 0. They were suppressed from the analysis
plot.CA(datos.usac)
library("FactoMineR")
res.ca2 <- CA(du, graph = FALSE)
## Warning in CA(du, graph = FALSE): The rows Clostridium, Yersinia sum at 0. They
## were suppressed from the analysis
res.ca2
## **Results of the Correspondence Analysis (CA)**
## The row variable has 9 categories; the column variable has 9 categories
## The chi square of independence between the two variables is equal to 218.1692 (p-value = 1.056517e-18 ).
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$col" "results for the columns"
## 3 "$col$coord" "coord. for the columns"
## 4 "$col$cos2" "cos2 for the columns"
## 5 "$col$contrib" "contributions of the columns"
## 6 "$row" "results for the rows"
## 7 "$row$coord" "coord. for the rows"
## 8 "$row$cos2" "cos2 for the rows"
## 9 "$row$contrib" "contributions of the rows"
## 10 "$call" "summary called parameters"
## 11 "$call$marge.col" "weights of the columns"
## 12 "$call$marge.row" "weights of the rows"
library("factoextra")
eig.val <- get_eigenvalue(res.ca2)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 0.3838634503 64.04492590 64.04493
## Dim.2 0.1386238174 23.12841221 87.17334
## Dim.3 0.0413729803 6.90279175 94.07613
## Dim.4 0.0223445682 3.72803457 97.80416
## Dim.5 0.0077707397 1.29649345 99.10066
## Dim.6 0.0034423270 0.57432813 99.67499
## Dim.7 0.0014711407 0.24544951 99.92044
## Dim.8 0.0004768823 0.07956448 100.00000
fviz_screeplot(res.ca2, addlabels = TRUE, ylim = c(0, 50))
En este caso la primera dimension reprsenta un poco mas del 50%, y con la segun explica pcoo mas del 70%.
fviz_ca_biplot(res.ca2, repel = TRUE)
fviz_ca_biplot(res.ca2,
map ="rowprincipal", arrow = c(TRUE, TRUE),
repel = TRUE)
library(matlib)
## Warning: package 'matlib' was built under R version 4.1.1
library(readr)
monk_84 <- read_csv("C:/Users/Cristopher/Desktop/certificados/UOC/Bioestadistica y Bioinformatica/Analisis Multivariante/PEC1/monk_84.dis",
col_names = FALSE)
## Rows: 107 Columns: 1
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): X1
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#eliminamos las filas que no necesitamos
monk_84 = monk_84[-1:-2,]
monk_84 = monk_84[-92:-105,]
View(monk_84)
monk_85 <- read_csv("C:/Users/Cristopher/Desktop/certificados/UOC/Bioestadistica y Bioinformatica/Analisis Multivariante/PEC1/monk_85.dis",
col_names = FALSE)
## Rows: 107 Columns: 1
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): X1
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
monk_85 = monk_85[-1:-2,]
monk_85 = monk_85[-92:-105,]
View(monk_85)
#ahora creamos la matriz
matriz1<-symMat(monk_84$X1, diag = FALSE, byrow =TRUE, names = TRUE)
matriz2<-symMat(monk_85$X1,diag = FALSE,byrow = TRUE,names = TRUE)
rownames(matriz1) <- c("ALFA","FRAN","FELL","PANC","ISA","GILD","BETI","OLGA","ORSE","ROSS","DIVO","CIST","ELET","EVA")
matriz1<-as.dist(matriz1)
rownames(matriz2) <- c("ALFA","FRAN","FELL","PANC","ISA","GILD","BETI","OLGA","ORSE","ROSS","DIVO","CIST","ELET","EVA")
matriz2<-as.dist(matriz2)
matriz1
## ALFA FRAN FELL PANC ISA GILD BETI OLGA ORSE ROSS DIVO CIST ELET
## FRAN 126
## FELL 55 367
## PANC 55 63 37
## ISA 27 96 62 27
## GILD 4 3 22 211 60
## BETI 39 39 74 33 90 30
## OLGA 18 10 19 5 26 21 17
## ORSE 73 24 68 8 51 100 38 16
## ROSS 12 23 32 2 8 51 100 38 16
## DIVO 46 1 11 28 14 20 10 1 0 8
## CIST 3 3 26 1 17 3 35 9 3 24 35
## ELET 8 3 7 1 6 8 5 17 16 6 42 28
## EVA 39 4 46 0 3 4 16 16 62 9 6 26 436
matriz2
## ALFA FRAN FELL PANC ISA GILD BETI OLGA ORSE ROSS DIVO CIST ELET
## FRAN 80
## FELL 44 156
## PANC 49 109 77
## ISA 83 53 33 20
## GILD 22 286 59 161 120
## BETI 18 117 113 105 153 44
## OLGA 29 22 7 17 27 10 22
## ORSE 23 9 105 7 28 20 27 256
## ROSS 21 4 40 9 41 11 75 9 7
## DIVO 18 2 78 7 25 4 20 4 2 12
## CIST 11 74 47 2 44 56 41 38 18 69 14
## ELET 60 107 7 0 4 16 14 28 14 146 9 39
## EVA 62 3 96 0 15 15 20 11 109 12 34 22 204
#podemos comprobar si es son conjuntos
library(ade4)
## Warning: package 'ade4' was built under R version 4.1.1
##
## Attaching package: 'ade4'
## The following object is masked from 'package:FactoMineR':
##
## reconst
#como tenemos un error porque en nuestra matriz hay ceros, debemos cambiar dicho numeros, existan varias maneras de hacerlo, pero una de ella seria reemplazar los ceros por el valor mas proximo que sea valida, en este caso podria ser reemplazados por un 1 o por 1/5 del valor minimo de a matriz, en este caso el minimo es 2, por lo tanto lo reemplazamos los 0 por un 2/5.
monk_85b = monk_85[-1:-2,]
monk_85b = monk_85[-92:-105,]
monk_85b = as.numeric(monk_85b$X1)
monk_85b[70] <- 0.4
monk_85b[82] <- 0.4
matriz2b<-symMat(monk_85b, diag = FALSE, byrow =TRUE, names = TRUE)
is.euclid(as.dist(matriz2b),print=T)
## [1] 4.452048e+04 3.394026e+04 2.308196e+04 1.398872e+04 3.343626e+03
## [6] 2.417038e+03 1.313892e+03 6.088755e-12 -1.266706e+03 -2.984402e+03
## [11] -3.780130e+03 -1.479585e+04 -2.559632e+04 -3.969348e+04
## [1] FALSE
En este caso obtenemos que el conjunto no es euclideo, ya que hay valores propios que son negativos
#realizamos lo mismo para la otra matriz, aqui el minimo es 1 asi que reemplazamos el 0 por el 1/5.
monk_84b = monk_84[-1:-2,]
monk_84b = monk_84[-92:-105,]
monk_84b = as.numeric(monk_84b$X1)
monk_84b[54] <- 0.2
monk_84b[82] <- 0.2
matriz1b<-symMat(monk_84b, diag = FALSE, byrow =TRUE, names = TRUE)
is.euclid(as.dist(matriz1b),print=T)
## [1] 9.508056e+04 6.777172e+04 2.294446e+04 6.347669e+03 2.572985e+03
## [6] 6.604177e+02 1.962426e+02 -1.140057e-11 -2.058370e+02 -1.280993e+03
## [11] -3.415017e+03 -1.605889e+04 -5.215826e+04 -8.694962e+04
## [1] FALSE
En este caso obtenemos que el conjunto no es euclideo, ya que hay valores propios que son negativos.
set.seed(123)
library(MASS)
library(vegan)
## Warning: package 'vegan' was built under R version 4.1.1
## Loading required package: permute
## Warning: package 'permute' was built under R version 4.1.1
## Loading required package: lattice
## This is vegan 2.5-7
data.dist1 <- vegdist(matriz1, "bray")
nmds.iso1 <- isoMDS(data.dist1, trace = FALSE)
nmds.iso1
## $points
## [,1] [,2]
## ALFA 0.07236186 0.31211658
## FRAN 0.98797935 -0.03017324
## FELL 0.37575707 0.31060222
## PANC 0.51577959 -0.48838776
## ISA 0.18038530 0.11130961
## GILD 0.32738531 0.68110878
## BETI 0.30685778 -0.02141391
## OLGA -0.12234328 -0.47209839
## ORSE 0.30125660 -0.26746976
## ROSS -0.23917323 0.02641705
## DIVO -0.16054814 -0.69616387
## CIST -0.42440734 -0.35365123
## ELET -1.57121705 0.08406493
## EVA -0.55007383 0.80373900
##
## $stress
## [1] 17.05513
matriz1<-symMat(monk_84$X1, diag = FALSE, byrow =TRUE, names = TRUE)
rownames(matriz1) <- c("ALFA","FRAN","FELL","PANC","ISA","GILD","BETI","OLGA","ORSE","ROSS","DIVO","CIST","ELET","EVA")
plot(nmds.iso1$points[,1], nmds.iso1$points[,2], type = "n",
xlab="NMDS1", ylab="NMDS2")
text(nmds.iso1$points[,1], nmds.iso1$points[,2], labels = rownames(matriz1), cex=0.8)
set.seed(123)
nmds.meta1 <- metaMDS(matriz1, distance="bray",engine = "monoMDS")
## Run 0 stress 0.2731383
## Run 1 stress 0.2592015
## ... New best solution
## ... Procrustes: rmse 0.1774659 max resid 0.425773
## Run 2 stress 0.2578706
## ... New best solution
## ... Procrustes: rmse 0.193138 max resid 0.3378181
## Run 3 stress 0.266418
## Run 4 stress 0.2540406
## ... New best solution
## ... Procrustes: rmse 0.2326362 max resid 0.4392052
## Run 5 stress 0.2511268
## ... New best solution
## ... Procrustes: rmse 0.2167948 max resid 0.451331
## Run 6 stress 0.2846127
## Run 7 stress 0.2568324
## Run 8 stress 0.2511267
## ... New best solution
## ... Procrustes: rmse 0.0001355036 max resid 0.0002444486
## ... Similar to previous best
## Run 9 stress 0.2617639
## Run 10 stress 0.2866009
## Run 11 stress 0.2663672
## Run 12 stress 0.2927544
## Run 13 stress 0.2829919
## Run 14 stress 0.2785755
## Run 15 stress 0.265869
## Run 16 stress 0.2761516
## Run 17 stress 0.2766908
## Run 18 stress 0.2704178
## Run 19 stress 0.2585522
## Run 20 stress 0.266845
## *** Solution reached
nmds.meta1$stress*100
## [1] 25.11267
El estres en este caso es del 25% lo que se consideraria alto.
plot(nmds.meta1, type="text")
## species scores not available
En este caso vemos que la segun forma es la mejor, los niveles de estress tienden a aumentar de igual manera, y es mayor en este segundo caso que con los primeros resultados obtenidos.
Ahora aplicamos lo mismo para la segunda matriz
set.seed(123)
library(MASS)
library(vegan)
data.dist2 <- vegdist(matriz2, "bray")
nmds.iso2 <- isoMDS(data.dist2, trace = FALSE)
nmds.iso2
## $points
## [,1] [,2]
## ALFA 0.02813951 0.201603302
## FRAN 0.03493319 -0.691537943
## FELL 0.36884823 -0.164862669
## PANC 0.23031062 -0.367514903
## ISA 0.25199638 -0.461929476
## GILD 0.66639014 -0.096017081
## BETI 0.38617222 0.121345965
## OLGA -0.90782217 -0.003730728
## ORSE -0.18994787 0.817631077
## ROSS -0.40940982 -0.370391361
## DIVO -0.50615524 0.544667872
## CIST 0.07340798 0.115641603
## ELET 0.57574512 0.576651969
## EVA -0.60260828 -0.221557628
##
## $stress
## [1] 17.92049
matriz2<-symMat(monk_85$X1, diag = FALSE, byrow =TRUE, names = TRUE)
rownames(matriz2) <- c("ALFA","FRAN","FELL","PANC","ISA","GILD","BETI","OLGA","ORSE","ROSS","DIVO","CIST","ELET","EVA")
plot(nmds.iso2$points[,1], nmds.iso2$points[,2], type = "n",
xlab="NMDS1", ylab="NMDS2")
text(nmds.iso2$points[,1], nmds.iso2$points[,2], labels = rownames(matriz2), cex=0.8)
set.seed(123)
nmds.meta2 <- metaMDS(matriz2, distance="bray",engine = "monoMDS")
## Run 0 stress 0.2649616
## Run 1 stress 0.2572122
## ... New best solution
## ... Procrustes: rmse 0.2311789 max resid 0.4898662
## Run 2 stress 0.3191461
## Run 3 stress 0.2761081
## Run 4 stress 0.2519565
## ... New best solution
## ... Procrustes: rmse 0.1709096 max resid 0.359803
## Run 5 stress 0.2785552
## Run 6 stress 0.2761659
## Run 7 stress 0.2636347
## Run 8 stress 0.2735965
## Run 9 stress 0.2638952
## Run 10 stress 0.2538297
## Run 11 stress 0.2898723
## Run 12 stress 0.2538296
## Run 13 stress 0.3266821
## Run 14 stress 0.2698129
## Run 15 stress 0.2700308
## Run 16 stress 0.2556665
## Run 17 stress 0.2905173
## Run 18 stress 0.2692357
## Run 19 stress 0.2831034
## Run 20 stress 0.2779129
## *** No convergence -- monoMDS stopping criteria:
## 18: stress ratio > sratmax
## 2: scale factor of the gradient < sfgrmin
nmds.meta2$stress*100
## [1] 25.19565
El estres es del 25%, ademas vemos que con este metodo tiende a aumentar al 25%.
plot(nmds.meta2, type="text")
## species scores not available
Tambien obtenemos que la mejor manera seria esta ultima, el stress aumenta a un 25%, lo cual se consideraria alto para este caso.
monkeys<-read_csv("C:/Users/Cristopher/Desktop/certificados/UOC/Bioestadistica y Bioinformatica/Analisis Multivariante/PEC1/PEC1_multivariante/monkeys.csv",
col_names = FALSE)
## Rows: 15 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): X2, X3, X4, X5
## dbl (1): X1
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
monkeys <- read_csv("monkeys.csv")
## New names:
## * `` -> ...1
## Rows: 14 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): name, alias, age, sex
## dbl (1): ...1
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View(monkeys)
#graficamos para la epoca fuera de la cria
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.5 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v purrr 0.3.4 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'tidyr' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
library(ggrepel)
plot.nmds1<-as.data.frame(nmds.meta1$points)
plot.nmds1$sex = as.factor(monkeys$sex)
p1 <- ggplot(plot.nmds1, aes(plot.nmds1$MDS1, plot.nmds1$MDS2)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)
p1 + geom_label_repel(aes(label = rownames(plot.nmds1),
fill = factor(plot.nmds1$sex)), color = 'white',
size = 3.5) +
theme(legend.position = "bottom")
## Warning: Use of `plot.nmds1$MDS1` is discouraged. Use `MDS1` instead.
## Warning: Use of `plot.nmds1$MDS2` is discouraged. Use `MDS2` instead.
## Warning: Use of `plot.nmds1$sex` is discouraged. Use `sex` instead.
## Warning: Use of `plot.nmds1$MDS1` is discouraged. Use `MDS1` instead.
## Warning: Use of `plot.nmds1$MDS2` is discouraged. Use `MDS2` instead.
Ahora, realizamos un plot para la segunda temporada
#graficamos para la temporada de la cria
library(tidyverse)
library(ggrepel)
plot.nmds2<-as.data.frame(nmds.meta2$points)
plot.nmds2$sex = monkeys$sex
p2 <- ggplot(plot.nmds2, aes(plot.nmds2$MDS1, plot.nmds2$MDS2)) +
geom_point(color = 'red') +
theme_classic(base_size = 10)
p2 + geom_label_repel(aes(label = rownames(plot.nmds2),
fill = factor(plot.nmds2$sex)), color = 'white',
size = 3.5) +
theme(legend.position = "bottom")
## Warning: Use of `plot.nmds2$MDS1` is discouraged. Use `MDS1` instead.
## Warning: Use of `plot.nmds2$MDS2` is discouraged. Use `MDS2` instead.
## Warning: Use of `plot.nmds2$sex` is discouraged. Use `sex` instead.
## Warning: Use of `plot.nmds2$MDS1` is discouraged. Use `MDS1` instead.
## Warning: Use of `plot.nmds2$MDS2` is discouraged. Use `MDS2` instead.
Despues de observar los graficos, podemos apreciar que para el periodo de cria y no de cria el estress fue del 25% con el mejor metodo, esto podria indicar un mal ajuste. Vemos que para la temporada de "reproduccion" los machos alfa,divo,cist, beti, se acercan mas entre si, mientras que para las "female" tienden a alejarse en de los demas en la epoca de cria.
library(vegan)
proce <- procrustes(nmds.meta1,nmds.meta2)
plot(proce)
proce$X
## NMDS1 NMDS2
## ALFA -158.423370 -147.636207
## FRAN 176.356876 -52.060394
## FELL -229.930037 7.113788
## PANC 20.527586 146.107924
## ISA -151.843758 145.868282
## GILD -40.395470 -186.595117
## BETI 141.918540 -176.311345
## OLGA 60.485891 -7.102577
## ORSE 193.762168 100.465473
## ROSS -74.083020 124.402802
## DIVO 127.873371 24.804719
## CIST 4.871024 -81.669708
## ELET -109.560405 -50.829770
## EVA 38.440604 153.442129
## attr(,"scaled:center")
## NMDS1 NMDS2
## -1.903239e-15 3.172066e-15
os monos como feel,eva y orse, tienden a mantener algunas posiciones relativas, pero muchos de monos no la mantienen, se ven distancia entre la una y otra temporada.
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
set.seed(123)
nmds.meta2 <- metaMDS(matriz2, distance="bray",engine = "monoMDS")
## Run 0 stress 0.2649616
## Run 1 stress 0.2572122
## ... New best solution
## ... Procrustes: rmse 0.2311789 max resid 0.4898662
## Run 2 stress 0.3191461
## Run 3 stress 0.2761081
## Run 4 stress 0.2519565
## ... New best solution
## ... Procrustes: rmse 0.1709096 max resid 0.359803
## Run 5 stress 0.2785552
## Run 6 stress 0.2761659
## Run 7 stress 0.2636347
## Run 8 stress 0.2735965
## Run 9 stress 0.2638952
## Run 10 stress 0.2538297
## Run 11 stress 0.2898723
## Run 12 stress 0.2538296
## Run 13 stress 0.3266821
## Run 14 stress 0.2698129
## Run 15 stress 0.2700308
## Run 16 stress 0.2556665
## Run 17 stress 0.2905173
## Run 18 stress 0.2692357
## Run 19 stress 0.2831034
## Run 20 stress 0.2779129
## *** No convergence -- monoMDS stopping criteria:
## 18: stress ratio > sratmax
## 2: scale factor of the gradient < sfgrmin
nmds.meta2$stress*100
## [1] 25.19565
plot.nmdsf = plot.nmds2
plot.nmdsf$mns3 = plot.nmds2$MDS1
plot.nmdsf$sex = factor(plot.nmdsf$sex)
plot_ly(x = plot.nmdsf$MDS1, y = plot.nmdsf$MDS2, z = nmds.meta2$stress*100,
color = plot.nmdsf$sex)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Observamos que los machos tienden a estar cerca y agruparse, sin embargo hay uno que se aleja de los demas, estos ademas estan mas cerca del nivel de estress maximo.
Ejercicio 3
#cargamos los datos
load("C:/Users/Cristopher/Desktop/certificados/UOC/Bioestadistica y Bioinformatica/Analisis Multivariante/2. Analisis de componentes principales/gorriones.RData")
head(gorriones)
gorrionesa<- gorriones[,-6]
gorrionesa<-scale(gorrionesa)
head(gorrionesa)
## x1 x2 x3 x4 x5
## [1,] -0.5417191 0.7248615 0.17718246 0.05424955 -0.32937165
## [2,] -1.0890230 -0.2617555 -1.33272023 -1.00904159 -1.23720227
## [3,] -1.3626749 -0.2617555 -0.57776889 -0.12296564 -0.22850158
## [4,] -1.3626749 -1.0510492 -0.70359411 -1.36347197 -0.63198186
## [5,] -0.8153711 0.3302147 0.05135723 0.23146474 -0.53111179
## [6,] 1.3738444 1.1195083 0.68048336 0.94032550 0.07410862
gorrionesa_pca <- princomp(x = gorrionesa,cor = TRUE)
gorrionesa_pca
## Call:
## princomp(x = gorrionesa, cor = TRUE)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 1.9015726 0.7290433 0.6216306 0.5491498 0.4056199
##
## 5 variables and 49 observations.
print(summary(gorrionesa_pca), loadings = TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.9015726 0.7290433 0.62163056 0.5491498 0.4056199
## Proportion of Variance 0.7231957 0.1063008 0.07728491 0.0603131 0.0329055
## Cumulative Proportion 0.7231957 0.8294965 0.90678139 0.9670945 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## x1 0.452 0.690 0.420 0.374
## x2 0.462 -0.300 0.341 -0.548 -0.530
## x3 0.451 -0.325 -0.454 0.606 -0.343
## x4 0.471 -0.185 -0.411 -0.388 0.652
## x5 0.398 0.876 -0.178 -0.192
gorrionesa_pca$loadings[,1:5]
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## x1 0.4517989 0.05072137 0.6904702 0.42041399 0.3739091
## x2 0.4616809 -0.29956355 0.3405484 -0.54786307 -0.5300805
## x3 0.4505416 -0.32457242 -0.4544927 0.60629605 -0.3427923
## x4 0.4707389 -0.18468403 -0.4109350 -0.38827811 0.6516665
## x5 0.3976754 0.87648935 -0.1784558 -0.06887199 -0.1924341
gorrionesa_pca$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## x1 0.452 0.690 0.420 0.374
## x2 0.462 -0.300 0.341 -0.548 -0.530
## x3 0.451 -0.325 -0.454 0.606 -0.343
## x4 0.471 -0.185 -0.411 -0.388 0.652
## x5 0.398 0.876 -0.178 -0.192
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## SS loadings 1.0 1.0 1.0 1.0 1.0
## Proportion Var 0.2 0.2 0.2 0.2 0.2
## Cumulative Var 0.2 0.4 0.6 0.8 1.0
#calculamos los vectores propios
(vectores_propios = eigen(cor(gorrionesa))$vectors) # Using eigen() method
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.4517989 0.05072137 0.6904702 0.42041399 -0.3739091
## [2,] -0.4616809 -0.29956355 0.3405484 -0.54786307 0.5300805
## [3,] -0.4505416 -0.32457242 -0.4544927 0.60629605 0.3427923
## [4,] -0.4707389 -0.18468403 -0.4109350 -0.38827811 -0.6516665
## [5,] -0.3976754 0.87648935 -0.1784558 -0.06887199 0.1924341
(vectores_propios.svd = svd(gorrionesa)$v)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.4517989 -0.05072137 0.6904702 -0.42041399 0.3739091
## [2,] 0.4616809 0.29956355 0.3405484 0.54786307 -0.5300805
## [3,] 0.4505416 0.32457242 -0.4544927 -0.60629605 -0.3427923
## [4,] 0.4707389 0.18468403 -0.4109350 0.38827811 0.6516665
## [5,] 0.3976754 -0.87648935 -0.1784558 0.06887199 -0.1924341
#cargamos las cargas obtenidas anteriormente con princomp()
gorrionesa_pca$loadings[,1:5]
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## x1 0.4517989 0.05072137 0.6904702 0.42041399 0.3739091
## x2 0.4616809 -0.29956355 0.3405484 -0.54786307 -0.5300805
## x3 0.4505416 -0.32457242 -0.4544927 0.60629605 -0.3427923
## x4 0.4707389 -0.18468403 -0.4109350 -0.38827811 0.6516665
## x5 0.3976754 0.87648935 -0.1784558 -0.06887199 -0.1924341
# SCORES por multiplicacion de matrices
scores = gorrionesa %*% gorrionesa_pca$loadings[,1:5]
head(scores)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## [1,] 0.06428901 -0.60083713 -0.1712334 -0.515825561 -0.5487904
## [2,] -2.18031283 -0.44230082 0.4000696 -0.645459959 -0.2310766
## [3,] -1.14556567 0.01925412 -0.6761269 -0.716298164 -0.2088714
## [4,] -2.31106565 0.17199267 -0.3059621 0.149289289 -0.4781034
## [5,] -0.29504203 -0.66520783 -0.4742138 -0.545862110 -0.2444780
## [6,] 1.91626198 -0.59525444 0.6209330 0.006608669 0.2855166
multiplicamos la carga por la matriz original
head(gorrionesa_pca$scores)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## [1,] 0.06495523 -0.60706359 -0.1730078 -0.521171046 -0.5544775
## [2,] -2.20290734 -0.44688437 0.4042155 -0.652148842 -0.2334712
## [3,] -1.15743714 0.01945365 -0.6831336 -0.723721141 -0.2110360
## [4,] -2.33501516 0.17377503 -0.3091328 0.150836370 -0.4830580
## [5,] -0.29809954 -0.67210136 -0.4791281 -0.551518863 -0.2470115
## [6,] 1.93612015 -0.60142305 0.6273677 0.006677154 0.2884754
#calculamos la P'P
Dc = t(scores) %*% scores
#obtenemos las varianzas de os componentes principales dividiendo por n
diag(Dc)/length(gorrionesa)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 0.70843657 0.10413141 0.07570767 0.05908222 0.03223396
En este caso obtenemos la misma proporcion de la varianza de los componentes calculados en un apartado anterior, para los componentes uno, dos , tres , cuatro y cinco respectivamente
load("C:/Users/Cristopher/Desktop/certificados/UOC/Bioestadistica y Bioinformatica/Analisis Multivariante/PEC1/SNP.RData")
controls <- rownames(subject.support)[subject.support$cc==0]
pop <- subject.support[controls,"stratum"]
pop.all <- subject.support[,"stratum"]
use <- seq(1, ncol(snps.10), 10)
## Loading required package: snpStats
## Loading required package: survival
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
ctl10 <- snps.10[controls, use]
all10 <- snps.10[,use]
X <- as(ctl10, "numeric")
X.all <- as(all10,"numeric")
# 1 opcion
data <- X
for(i in 1:ncol(data)){
data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}
x.escalado<-scale(data)
#head(x.escalado)
# finalmente calculamos la XX'
xxtrasposicion<-x.escalado %*% t(x.escalado)
head(xxtrasposicion[1:6,1:6])
## jpt.869 jpt.862 jpt.948 ceu.564 ceu.904 ceu.665
## jpt.869 2601.7764 138.9807 192.9046 -283.5339 -414.6340 -235.4016
## jpt.862 138.9807 2775.1951 164.8674 -218.8962 -306.4598 -280.7771
## jpt.948 192.9046 164.8674 2531.5733 -361.6866 -311.3615 -278.4234
## ceu.564 -283.5339 -218.8962 -361.6866 2908.3334 439.0646 329.4606
## ceu.904 -414.6340 -306.4598 -311.3615 439.0646 2935.7316 313.8808
## ceu.665 -235.4016 -280.7771 -278.4234 329.4606 313.8808 2836.9213
#calculamos los valores propios y vectores propios
vectores.valores<-eigen(xxtrasposicion)
head(vectores.valores$values)
## [1] 146789.918 8589.204 8380.229 8236.379 8111.603 7962.776
head(vectores.valores$vectors[1:9,1:9])
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.03935020 0.03850041 -0.001919868 0.04309872 0.0273662439
## [2,] -0.04413015 -0.03497082 0.001042052 -0.01110718 0.0144786438
## [3,] -0.03929913 0.02880677 0.041696012 0.01155320 0.0067127197
## [4,] 0.04627103 -0.01256492 -0.022308638 -0.04047348 0.0003534879
## [5,] 0.04563004 -0.04508699 0.010979326 -0.07650281 -0.0251297985
## [6,] 0.04756201 -0.03498836 0.025033571 0.02927508 -0.0367371928
## [,6] [,7] [,8] [,9]
## [1,] -0.0256503000 0.010736130 -0.002503765 -0.016589996
## [2,] 0.0104443535 -0.022207087 0.004827079 0.013053613
## [3,] -0.0004460095 0.036779641 0.031218534 0.005143489
## [4,] -0.0067936604 0.032443449 -0.021830882 -0.028964318
## [5,] 0.0399223830 -0.009537312 -0.036011995 0.007572611
## [6,] -0.0314892819 0.072053195 0.062506203 0.007894174
par(mfrow=c(2,4))
boxplot(vectores.valores$vectors[,1]~pop)
boxplot(vectores.valores$vectors[,2]~pop)
boxplot(vectores.valores$vectors[,2]~pop)
boxplot(vectores.valores$vectors[,3]~pop)
Observando los graficos, vemos que solo en el primer componente vemos una diferencia entre las poblaciones.
#El ultimo numero es negativo, pero lo podriamos considerar 0, asi que lo cambiamos por un cero
vectores.valores$values[500]<-0
cargas.f = diag(1/sqrt(vectores.valores$values)) %*% t(vectores.valores$vectors)
head(cargas.f[1:9,1:9])
## [,1] [,2] [,3] [,4] [,5]
## [1,] -1.027067e-04 -1.151827e-04 -1.025734e-04 1.207706e-04 0.0001190975
## [2,] 4.154215e-04 -3.773369e-04 3.108266e-04 -1.355761e-04 -0.0004864910
## [3,] -2.097217e-05 1.138312e-05 4.554771e-04 -2.436941e-04 0.0001199355
## [4,] 4.748935e-04 -1.223871e-04 1.273016e-04 -4.459667e-04 -0.0008429644
## [5,] 3.038518e-04 1.607587e-04 7.453241e-05 3.924833e-06 -0.0002790202
## [6,] -2.874486e-04 1.170440e-04 -4.998180e-06 -7.613276e-05 0.0004473879
## [,6] [,7] [,8] [,9]
## [1,] 0.0001241401 -0.0001061971 1.256717e-04 -1.170009e-04
## [2,] -0.0003775263 0.0015892805 -1.918322e-04 -3.020601e-04
## [3,] 0.0002734607 0.0019511547 5.017774e-04 4.590318e-05
## [4,] 0.0003225745 -0.0003274337 1.359441e-03 1.473427e-04
## [5,] -0.0004078990 0.0016594317 -5.193414e-05 5.341013e-04
## [6,] -0.0003528828 -0.0007649088 -4.531283e-04 1.281613e-04
library(tidyverse)
datag <- X.all
for(i in 1:ncol(datag)){
datag[is.na(datag[,i]), i] <- mean(datag[,i], na.rm = TRUE)
}
escaladog<-scale(datag)
#head(x.escalado)
ftrasposicion<- escaladog %*% t(escaladog)
vectores.valoresf<-eigen(ftrasposicion)
fcomponent<-prcomp(ftrasposicion)
plot(fcomponent$x)